NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Accelerating Retrieval-Augmented Generation

https://doi.org/10.1145/3669940.3707264

Quinn, Derrick; Nouri, Mohammad; Patel, Neel; Salihu, John; Salemi, Alireza; Lee, Sukhan; Zamani, Hamed; Alian, Mohammad (March 2025, ACM)

An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4--27.9× faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7--26.3× lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM -- which is the most expensive component in today's servers -- from being stranded.
more » « less
Free, publicly-accessible full text available March 30, 2026
On nonlinear stability of Muskat bubbles

https://doi.org/10.1016/j.matpur.2025.103664

Gancedo, Francisco; García-Juárez, Eduardo; Patel, Neel; Strain, Robert M (February 2025, Journal de Mathématiques Pures et Appliquées)

Free, publicly-accessible full text available February 1, 2026
Limitations of Stochastic Selection with Pairwise Independent Priors

Dughmi, Shaddin; Kalayci, Yusuf Hakan; Patel, Neel (June 2024, Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC))
SmartDIMM: In-Memory Acceleration of Upper Layer Protocols

https://doi.org/10.1109/HPCA57654.2024.00032

Patel, Neel; Mamandipoor, Amin; Nouri, Mohammad; Alian, Mohammad (March 2024, IEEE)

Full Text Available
On Supermodular Contracts and Dense Subgraphs

https://doi.org/10.1137/1.9781611977912.6

Deo-Campo Vuong, Ramiro; Dughmi, Shaddin; Patel, Neel; Prasad, Aditya (January 2024, Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA))
Global Regularity for Gravity Unstable Muskat Bubbles

https://doi.org/10.1090/memo/1455

Gancedo, Francisco; García-Juárez, Eduardo; Patel, Neel; Strain, Robert (December 2023, Memoirs of the American Mathematical Society)

In this paper, we study the dynamics of fluids in porous media governed by Darcy’s law: the Muskat problem. We consider the setting of two immiscible fluids of different densities and viscosities under the influence of gravity in which one fluid is completely surrounded by the other. This setting is gravity unstable because along a portion of the interface, the denser fluid must be above the other. Surprisingly, even without capillarity, the circle-shaped bubble is a steady state solution moving with vertical constant velocity determined by the density jump between the fluids. Taking advantage of our discovery of this steady state, we are able to prove global in time existence and uniqueness of dynamic bubbles of nearly circular shapes under the influence of surface tension. We prove this global existence result for low regularity initial data. Moreover, we prove that these solutions are instantly analytic and decay exponentially fast in time to the circle.
more » « less
Full Text Available
XFM: Accelerated Software-Defined Far Memory

https://doi.org/10.1145/3613424.3623776

Patel, Neel; Mamandipoor, Amin; Quinn, Derrick; Alian, Mohammad (October 2023, ACM)

Full Text Available
Clinical utility of mesenchymal stem/stromal cells in regenerative medicine and cellular therapy

Maldonado, Vitali; Patel, Neel; Smith, Emma; Barnes, C.Lowry; Gustafson, Michael P; Rao, Raj R; Samsonraj, Rebekah M. (July 2023, Journal of biological engineering)

Full Text Available
On Sparsification of Stochastic Packing Problems

https://doi.org/10.4230/LIPIcs.ICALP.2023.51

Dughmi, Shaddin; Kalayci, Yusuf Hakan; Patel, Neel (January 2023, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Etessami, Kousha; Feige, Uriel; Puppis, Gabriele (Ed.)
Motivated by recent progress on stochastic matching with few queries, we embark on a systematic study of the sparsification of stochastic packing problems more generally. Specifically, we consider packing problems where elements are independently active with a given probability p, and ask whether one can (non-adaptively) compute a "sparse" set of elements guaranteed to contain an approximately optimal solution to the realized (active) subproblem. We seek structural and algorithmic results of broad applicability to such problems. Our focus is on computing sparse sets containing on the order of d feasible solutions to the packing problem, where d is linear or at most polynomial in 1/p. Crucially, we require d to be independent of the number of elements, or any parameter related to the "size" of the packing problem. We refer to d as the "degree" of the sparsifier, as is consistent with graph theoretic degree in the special case of matching. First, we exhibit a generic sparsifier of degree 1/p based on contention resolution. This sparsifier’s approximation ratio matches the best contention resolution scheme (CRS) for any packing problem for additive objectives, and approximately matches the best monotone CRS for submodular objectives. Second, we embark on outperforming this generic sparsifier for additive optimization over matroids and their intersections, as well as weighted matching. These improved sparsifiers feature different algorithmic and analytic approaches, and have degree linear in 1/p. In the case of a single matroid, our sparsifier tends to the optimal solution. In the case of weighted matching, we combine our contention-resolution-based sparsifier with technical approaches of prior work to improve the state of the art ratio from 0.501 to 0.536. Third, we examine packing problems with submodular objectives. We show that even the simplest such problems do not admit sparsifiers approaching optimality. We then outperform our generic sparsifier for some special cases with submodular objectives.
more » « less
Profiling gem5 Simulator

https://doi.org/10.1109/ISPASS57527.2023.00019

Umeike, Johnson; Patel, Neel; Manley, Alex; Mamandipoor, Amin; Yun, Heechul; Alian, Mohammad (April 2023, IEEE)

In this work, we set out to find the answers to the following questions: (1) Where are the bottlenecks in a state-of-theart architectural simulator? (2) How much faster can architectural simulations run by tuning system configurations? (3) What are the opportunities in accelerating software simulation using hardware accelerators? We choose gem5 as the representative architectural simulator, run several simulations with various configurations, perform a detailed architectural analysis of the gem5 source code on different server platforms, tune both system and architectural settings for running simulations, and discuss the future opportunities in accelerating gem5 as an important application. Our detailed profiling of gem5 reveals that its performance is extremely sensitive to the size of the Ll cache. Our experimental results show that a RISC-V core with 32KB data and instruction cache improves gem5’s simulation speed by 31%-61% compared with a baseline core with 8KB Ll caches. Our paper is the first step toward building specialized hardware and software environments for accelerating software-based simulators.
more » « less

« Prev Next »

Search for: All records